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Abstract — We consider the problem of identifying a pattern of 
faults from a set of noisy linear measurements. Unfortunately, 
maximum a posteriori probability estimation of the fault pattern 
is computationally intractable. To solve the fault identifica- 
tion problem, we propose a non-parametric belief propagation 
approach. We show empirically that our belief propagation 
solver is more accurate than recent state-of-the-art algorithms 
including interior point methods and semidefinite programming. 
Our superior performance is explained by the fact that we take 
into account both the binary nature of the individual faults and 
the sparsity of the fault pattern arising from their rarity. 

Index Terms — Fault diagnosis, stochastic approximation, mes- 
sage passing, belief maintenance 



I. Introduction 

THE fault identification problem is the task of determining 
which of a set of possible failures (faults) have occurred 
in a system given observations of its behavior Such problems 
arise in a variety of computer-based engineering applications, 
including aerospace UJ, ||2J, industrial process control |3 |, and 
automotive systems ||4l. 

A common approach to fault identification is to exploit 
a mathematical model of the system to construct residuals 
(deviations from the system's expected behavior), which can 
be used to detect and diagnose atypical behavior A wide 
variety of approaches have been proposed for constructing 
residual measurements. For example, state space models of 
the system may be used to generate parity checks, or the 
magnitude of errors in a Kalman filter may be used as an 
observer to detect changes in behavior The occurrence of a 
particular fault results in a characteristic change, or signature, 
which can be used to identify which fault or set of faults has 
occurred. For an overview and survey of residual methods, see 
Frank |5|. 

In this work we shall consider a typical model of fault 
measurement that describes the presence of each possible fault 
using a binary variable, and assumes that each fault affects 
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the observed system measurements in a linear additive way. 
Frequently there are far more potential faults than measure- 
ments, resulting in an ill conditioned set of linear equations. 
By assuming that faults are relatively rare, one can construct 
a combinatorial problem to identify the most likely pattern of 
faults given the system measurements. 

Systems of binary faults have been extensively studied in the 
computer science community, for example posing the problem 
as constraint satisfaction and using heuristic search techniques 
||6l, Q. An alternative approach is to use a convex relaxation 
of the original combinatorial problem, for example using the 
interior point method of Zimnis et al. 1 8 1 . 

A key assumption of the current work is that the fault 
signature matrix, which is the linear transformation matrix 
applied to the fault vector, is sparse. This is the case in many 
applications, where each fault may affect only a small part 
(such as a single subsystem) of the whole ||9|, lITOll . ifTTI . 
lfT2l . This assumption is especially valid in large scale systems 
where different faults affect different parts of the system. 

The fault identification problem is closely related to several 
coding and signal reconstruction problems. For example, in 
multi-user detection for code division multiple access (CDMA) 
systems ifTSll . lfT4]| . ifTSll . the system measurements are given 
by the received noisy wireless signal and the goal is to 
estimate the transmitted bit pattern, which plays the role of 
the fault pattern. One important difference is that in multi-user 
detection, each bit typically has an equal probability of being 
or 1, whereas in fault identification the prior probability that 
a bit is 1 (indicating that the fault is present) is typically much 
lower 

Compressed sensing (CS) problems are also closely related 
to fault identification. Informally, CS reconstructs a sparse 
signal from a set of noisy linear measurements of the signal. 
As in fault identification, the system of linear equations is ill 
conditioned, and the assumption of sparsity (corresponding to 
the rarity of faults) is critical in the reconstruction. Because 
of this similarity between CS and fault identification, in the 
experimental results section we compare the performance of 
several CS algorithms with previously proposed methods for 
fault identification. 

In this work, we develop a novel approach for fault iden- 
tification based on a variant of the belief propagation (BP) 
algorithm called non-parametric belief propagation (NBP). 
Belief propagation approaches have been applied to solve 
many similar discrete, combinatorial problems in coding, 
and variants of NBP (which allows the algorithm to reason 
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about real-valued variables) have been used in compressed 
sensing (16] and low-density lattice codes (codes defined over 
real-valued alphabets) ifTTl . lITSI , [19|. Our method constructs 
a relaxation of the fault pattern prior using a mixture of 
Gaussians, which takes into account both the binary nature 
of the problem as well as the sparsity of the fault pattern. 

Using an extensive experimental study, we show that our 
approach provides the best performance in identifying the 
correct fault patterns when compared to recent state-of-the-art 
algorithms, including interior point methods and semidefinite 
programming. To demonstrate the importance of each compo- 
nent of our model, we compare both to existing approaches for 
fault identification as well as to compressed sensing algorithms 
and a discrete BP formulation. We explain how to implement 
an efficient quantized version of NBP, and additionally provide 
the source code used in our experiments ll20l . 

The structure of this paper is as follows. Section intro- 
duces the fault identification problem in terms of maximum a 
posteriori (MAP) estimation, and describes graphical models 
and the BP algorithm. Section |III] presents our solution, based 
on NBP. Section |IV] explains implementation details and 
optimizations. Section [V] compares the accuracy of several 
state-of-the-art methods for fault identification, and shows that 
our proposed method has the highest accuracy. We shed light 
on the favorable performance of NBP from an information 
theoretic perspective in Section IVII and conclude with a 
discussion in Section NTH 

II. Fault Identification Problem 

In this section we describe the fault model in detail, and 
the basic MAP approach for estimating the fault pattern. 
Our goal is to infer the maximum a posteriori (MAP) fault 
value, the fault pattern most likely to have occurred given 
a set of observations. We then briefly review probabilistic 
graphical models and the beUef propagation algorithm within 
this context. 

A. Fault model and prior distribution 

We consider a system in which there are n potential 
faults, any combination of which (2" in total) can occur A 
fault pattern, i.e., a set of faults, is represented by a vector 
X e {0, 1}", where Xs = 1 means that fault s , s e {1, • • • ,n} 
has occurred. We assume that faults are independent and 
identically distributed (i.i.d.), and that fault s occurs with 
known probability p. Thus, the (prior) probability of fault 
pattern x occurring is 

n 
s=l 

The fault pattern .t = corresponds to the null hypothesis, the 
situation in which no faults have occurred, with probability 
p{0) = (1 — p)", and the expected number of faults is 

En 
s=iP = np. 

We assume that m scalar real measurements, denoted by y, 
y 6 W"\ are available. These measurements depend on the 



fault pattern x G {0, 1}" linearly: 

y — Ax + V , 

where A e jjmx" jg ^jjg fault signature matrix, and the 
measurement noise v £ M™ is random, with Vi independent of 
each other and x, each with M{0,a'^) distribution. Typically 
the system of linear equation is under-determined (n > m), 
which means the number of potential faults is greater than the 
number of measurements. 

The fault signature matrix A is assumed to be known. Its s*'' 
column fls G M™ corresponds to the measurements, when only 
fault s has occurred and assuming no noise. For this reason 
is called the s"* fault signature. Since x is a Boolean vector. 
Ax is simply the sum of the fault signatures corresponding 
to the faults that have occurred. We further assume A to be 
sparse, i.e., to have relatively few non-zero values. This is the 
case in many appUcations, where each fault may affect only a 
small part (such as a single subsystem) of the whole ||9], ifTOl . 

mil, ina. 

B. Posterior probability 

Let p{x\y) denote the (posterior) probability of pattern x 
given the measurements y. By Bayes rule we have 

p{x\y) Qc p{y\x)p{x) ^ p{x,y) 

^N{y Ax,a^I)\{p-^{l-pf--^ . (1) 

By letting C and C" indicate constants (values independent 
of x), we can define the log-loss function as the negative log 
probability ly{x) = — \ogp{x\y) and write 

m 

ly{x) = X^x + 2^ - Ax)'^{y - Ax) + C 

i=l 

^^x^A^Ax + {\-c7-^A^yfx + C', (2) 

where As = log((l ~p)/p) denotes the log-odds ratio. Note 
that (|2]l is a convex quadratic function of x (a binary-valued 
vector), and is thus a convex integer quadratic program. 

C. Graphical Models 

Graphical models are used to represent and exploit the 
structure of the cost function (|2]), or its associated probability 
distribution ([Hi, to develop efficient estimation algorithms. 
Specifically, we factor p[x) into a product of smaller functions, 
called factors, each of which is defined using only a few vari- 
ables. This collection of smaller functions is then represented 
as a factor graph G, in which each variable is associated with a 
variable node and each factor with a factor node. Factor nodes 
are connected to variable nodes that represent their arguments. 
Because ([H can be represented as the sum of terms involving 
at most two variables, 

hi^) ^ X! JstXsXt + '^hsXs , (3) 

s,t>s s 
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where 

Jst = ^iA'^A),t, 

and we can represent p{x\y) or ly{x) as a pairwise graph with 
an edge between Xg and xt if and only if Jst is non-zero. This 
corresponds to factoring p{x\y) as 

p{x\y) oc Jl Mxs,Xt)Y[9s{xs) 

i={s,t>s) s 

= exp{-JstXsXt)Y\_'i^p{-hsXs), 

{s,t>s) s 

where fi{-) and f/s(') are factors; we use the convention that 
functions g and indices s,t refer to local (single-variable) 
factors while / and indices i,j refer to higher-order (in this 
case, pairwise) factors. For compactness, we generically write 
f{x) to indicate a function / over some subset of the Xs- 

The graph structure is used to define efficient inference al- 
gorithms, including MAP estimation or marginalization. Both 
problems are potentially difficult, as they require optimizing or 
summing over a large space of possible configurations. How- 
ever, structure in the graph may induce an efficient method 
of performing these operations. For example, in sequential 
problems the Viterbi algorithm ET\ (and more generally, 
dynamic programming) provides efficient optimization, and 
can be represented as a message-passing algorithm on the 
graph G. 

In graphs with more complicated structure such as cycles, 
exact inference is often difficult; however, similar message- 
passing algorithms such as loopy belief propagation perform 
approximate inference Il22l . The max-product algorithm is 
a form of BP that generalizes dynamic programming to an 
approximate algorithm. One computes messages mf^ from 
factors to variables and from variables to factors, 

ml{x,) (xma.xf,{x) TT TO^Cxt) , 

TT f (4) 

ml,{xs) (X gs{xs) [[ m (xs ) , 

where messages are normalized for numerical stability, and 
Ts is the neighborhood of node s in the graph (all nodes for 
which di is non-zero, excluding s). One can also compute a 
"belief" bs{xs) about variable Xs, 

bs{Xs) = 9s{Xs) Yi "^(si^s) , (5) 

which can be used to select the configuration of Xs by choosing 
its maximizing value. If the graph G is singly-connected 
(no cycles), then the max-product algorithm is equivalent to 
dynamic programming. However, the algorithm performs well 
even in graphs with cycles, and has been shown to be highly 
successful in many problems, most notably the decoding of 
low-density parity check (LDPC) codes 1231 . Max-product 
and its so-called reweighted variants are closely related to 
linear programming relaxation techniques, but by exploiting 



the problem structure can be more efficient than generic linear 
programming packages ["24]. 

The sum-product formulation of BP is intended for ap- 
proximate marginalization, rather than optimization. Despite 
this, the sum-product algorithm has been frequently applied 
to MAP estimation problems, as it often exhibits better con- 
vergence behavior than max-product ||251 . It has an almost 
identical message-passing formulation, 

^Li^s) oc Mx) Y[ ^uixt) , 

x\x, ter,\s 

TT f ^ -* 

ml,{xs) OC gs{xs) [[ m (x^ ) . 

Again, one can estimate each Xg by choosing the value that 
maximizes the belief bs{xs), in this case corresponding to the 
maximum posterior marginal estimator Although the sum- 
product beliefs are intended to approximate the marginal 
distributions of each Xg, if the most likely joint configurations 
all share a particular value for Xg, then this will be reflected 
in the marginal probability as well. 

Variants of BP typically rely on graph sparsity (few edges) 
to ensure both efficiency and accuracy. When the graph has 
no cycles, these algorithms are exact; for graphs with cycles, 
they are typically only approximate but are often accurate in 
systems with long, weak, or irregular cycles [26|. Unfortu- 
nately, although the fault signature matrix A may be sparse, the 
same is typically not true of the matrix J = A'^A, especially 
for large m and n. In the dense case, a direct application 
of BP to Q may fail. For example, in experiments, with 
A sized 100 x 200 and approximately 10% non-zero values 
±1, A'^A is approximately 50% non-zero, and BP algorithms 
defined using (O did less well than the current state of the art 
methods. As the dimension sizes m and n are increased, A^A 
becomes dominated by non-zeros. This motivates us to define 
an alternative graphical model that relies on the sparsity of A 
itself. 

III. Belief Propagation for Fault Identification 

Since the structure of the matrix A is sparse, let us define an 
alternative graphical model that uses A explicitly. In particular, 
we can write the probability distribution 

p{x,y) = Y[Myt,x)Y[9s{xs) , 

i s 

in terms of the factors, 

Mx)^^fid,x■,y,,a^), g,(a:,)=p^=(l-p)l-^^ (7) 

Notably, if A is sparse (specifically, if each row di is sparse), 
then the factors fi will depend on only the few Xg for which 
di is non-zero and the graph representing this factorization 
will be sparse as well. An example factor graph is shown in 
Fig.Ea). 

Using either max-product (|4]l or sum-product (|6]l on the 
resulting factor graph is computationally difficult, as it in- 
volves eliminating (maximizing or summing over) all expo- 
nentially many configurations of the neighboring variables 
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of fi (2^ evaluations for factors over d variables). This can 
quickly become intractable for even moderate neighborhood 
sizes. Although the relationships among the Xg and j/i are 
simple and linear, our model defines a hybrid distribution over 
both continuous valued and discrete valued random variables. 
Observing yi creates a combinatorial dependence among the 
neighboring discrete-valued Xg- 

Although it may seem counter-intuitive, we can avoid some 
of these difficulties by converting the graphical model to a 
fully continuous model. We abuse notation slightly to define 
continuous variables Xg in place of their discrete counterparts, 
with corresponding factors 

gsixs) = Ps5{xs = 1) + (1 - Ps)5{xs = 0) . 

It will also prove convenient to relax the prior slightly into a 
Gaussian form, using the approximation 

gs(xs) ^ PsN{xs;l,v) + {I - ps)N{xs;Q,v) , (8) 

where the variance v controls the quality of the approximation; 
as ;/ — > 0, we recover the original prior on Xg. The factors /;(■) 
and gs{ ) are illustrated in Fig. [Hb). 

The BP message-passing algorithm remains applicable to 
models defined over continuous random variables. However, 
message representation is more difficult as we must represent 
continuous functions over those variables. For non-Gaussian 
models, message representation typically requires some form 
of approximation. 

In our continuous model, both sets of factors fi and gs 
consist of mixtures of Gaussians; thus their product, and the 
messages computed during BP, will also be representable as 
mixtures of Gaussians ll22l . ifTTl . Unfortunately, at each step 
of the algorithm, the number of mixture components required 
to represent the messages will increase at an exponential rate, 
and must be approximated by a smaller mixture. To handle the 
exponential growth of mixture components, approximations 
to BP over continuous variables were proposed in NBP l,22J 
and variants in a broad array of problem domains lIZTl . ifTTl . 
[[TS'l, [16|. Those approximations use sampling to limit the 
number of components in each mixture. A number of sampling 
algorithms have been designed to ensure that this process is 
as efficient as possible f28l, 1291 , ll30l . Such sample based 
representations are particularly useful in high dimensional 
spaces, where discretization becomes computationally diffi- 
cult. Various authors have also proposed message approxima- 
tion methods based on dynamic quantization or discretization 
techniques 1311. Il32l. 

For the fault identification problem our variables Xs are 
one dimensional and can be reasonably restricted to a finite 
domain, for example Xs £ 3i^, 1 + 3;^] where v is the 
assumed standard deviation of the noisy linear system, and 0,1 
are the binary fault values. Thus, it is both computationally 
efficient and sufficiently accurate to use a simple uniform 
discretization over possible values, allowing our functions 
over Xs to be represented by fixed-length vectors |16|, |17|. 
Although typically the term "non-parametric BP" refers to 
an algorithm using stochastic samples and Gaussian mixture 
approximations to the messages, rather than a fixed discrete 



quantization, here we use it more generically to distinguish 
BP in our fully continuous, Gaussian mixture model from a 
standard discrete BP performed directly on (|3]i. 

IV. Implementation 

In this section we discuss some details of our implemen- 
tation of NBP and our overall fault identification algorithm. 
These include several techniques that make our algorithm more 
efficient, and two local optimization heuristics which may 
improve solution quality. 

A. Computation of messages 

We apply the BP algorithm presented in (|6]l using the 
factors (|7]l and the self potentials defined by the Gaussian 
approximation (HJ. The variable-to-factor messages are given 
by the usual formula. 



(9) 



and we compute the message product more efficiently by 
noting that 



9s 



n ' 



gs W m^js I 



/ 

is ' 



(10) 



where g^ is a shorthand to gs{xs) and the same for mj^. The 
product on the r.h.s of (fTol l can be computed once and reused 
for each outgoing message 1331 . ||271 . 

The factor-to-variable messages require a more detailed 
analysis. It is important to note that our factors fi(x) are 
functions of several variables, say = {x^^ . . . x^^}. A 
message computation, for example from factor i to variable 
si, can be explicitly written as 



/ 

IS\ 



fi {Vi ^ ^ ^isj Xs 

i=i 



J=2 



dxTi ■ 

(11) 



Although an arbitrary function on d variables would require 
0{n'^) computation, where n is the number of discretization 
bins, the messages can be computed more efficiently, because 
fi is a function of a linear combination of the Xs f34l, fTTl. 
In essence, one uses a change of variables to separate each 
integrand, defining variables for the cumulative sum, Xs^ ~ 
cis^Xsj +5sj+i, and scaled messages fhs^i{x) ^ ms-i{x Idis-). 
We re-express (fTTT i. 



fii^^isi^si ^^52) / ^S2i (-^52 "^ss) 



.ms^iixs^)dxri, 



in which each integral (approximated by a discrete sum) 
requires 0{n^) computation. Note also that, computing from 
the right-hand side, each step takes the form mj{xj) = 
J rri{xj — Xjj^\)mjj^\{xjj^\)dxjj^\, and can thus be thought 
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Fig. 1. Graphical model for the fault identification problem, (a) Factor graph representing dependency among variables, including faults {xs } and measurements 
{yi\- Graph edges indicate non-zero values of ciis- (b) Potential functions Cjs capturing the binary and sparse nature of the Xa, and fi representing the linear 
measurements with Gaussian noise. 



of as a convolution operator, m{xj) m{xj+i) lf35l . For 
convenience, we write this as 

ml^i^s) cc fi{aisXs)® m'"ti{xt/ait) . (12) 

ter,\s 

Moreover, since convolution can be computed by an element- 
wise product in the Fourier domain, the factor-to-variable mes- 
sages can be evaluated even more efficiently by first rescaling 
each incoming message, transforming into the Fourier domain, 
taking a product, transforming back, and unsealing, resulting 
in the update rule. 




where F and are the discrete Fourier and inverse Fourier 
transforms, respectively. Again, the products are computed 
more efficiently using all terms and dividing as in (fTol i. 

This highlights the basic advantage over a formulation in 
which the Xg are explicitly discrete-valued. Although the exact 
calculations are exponential over the degree of the factor fi in 
both cases, the continuous-valued formulation provides us the 
opportunity to approximate the intermediate quantities (in our 
implementation, using a discretization) and separates the com- 
putation into a simple and efficiently computed form (fT3T l. We 
note that the rescaling step is equivalent to the stretch/unstretch 
operations proposed in LDLC decoding ifTTl . Additionally, if 
the scale factors an are bipolar (±1), then rescaling becomes 
trivial (e.g., for —1, the vector is reversed), and the algorithm 
simplifies further. 

We proceed by computing all messages from variables to 
factors according to (|9]l, then computing all messages from 
factors to variables by ( fT2] i. The algorithm is run for a 
predetermined number of iterations, or until convergence is 
detected locally. For detecting convergence, we use the ^2 
norm of the product of all incoming messages in this round, 
relative to the product of all incoming messages in previous 
rounds. Finally, each variable node Xg computes its belief, and 



we estimate its value by rounding to the closest fault value 
(either zero or one), 

Xs = round argmax-! gs{xs) TT m{ {xg) \ ■ (14) 

B. Discretization of the Gaussian mixture 

Recall that messages are Gaussian mixtures representing 
the posterior and are discretized for efficiency. To store each 
message, we allocate a vector of a fixed length q. Entries in 
this vector are real positive values. Typically we used values 
of q in the range 512 — 1024. A higher value of q makes the 
algorithm more accurate but slows execution time. We evaluate 
the messages at fixed intervals A within a range of interest, 
identical for each variable. This range should include, for 
example, both and 1 when the fault pattern is binary and the 
actual real-valued measurements; because of noise, the range 
of discretization is further increased to include several standard 
deviations of uncertainty. We used the following formula to 
determine the scope of the discretization, 

i? = max{max{|yi|}, 1} +3;/, 

i 

and centered each message around zero, i.e., the range 
[— i?, i?]. This last point is useful for computing the scaling 
operation on negative-valued edges, by first computing the 
scale, then reversing the output around zero. 

C. Local optimization procedures 

We outline two useful heuristics proposed by Zymnis et 
al. fSl that are used for improving the quality of our solutions: 
variable threshold rounding and a local search procedure. The 
purpose of these heuristics is to obtain an integer solution out 
of the fractional solution obtained using our BP solver. 

a) Variable threshold rounding: Let x* denote a soft 
decision, a vector of the same length as x whose entries 
are real-valued (rather than binary); x* may correspond, for 
example, to the BP belief or to the optimal point of a convex 



SUBMITTED TO IEEE TRANS. ON SIGNAL PROCESSING 



6 



Group 


Algorithm 


Abbreviation 


Prior on x 


BP 


NBP Solver (current work) 
Max-product BP (current work) 
Compressed sensing Belief Propagation 1161 
Low density lattice decoder 1 17 


NBP 

Max-prod 

CSBP 

LDLC 


binary and sparse 
binary and sparse 
sparse 
binaiy 


LP 


Interior point (Newton method) [8] 
Semidefinite programming! 36), |37| 


IP 

SDP 


^ — n 

X e [0, 1] 

X e [0, 1] 


CS 


Iterative signal recovery |381 

Gradient Projection for Sparse Reconstruction | 39 | 

Iterative hard thresholding |40| 


CoSaMP 

GPSR 

haidIO 


sparse 
sparse 
sparse 


Other 


All zero hypothesis 


Null 


X is constant 



TABLE I 

Algorithms used for comparison, grouped into general categories: belief propagation methods (BP), linear programming (LP), 

AND COMPRESSED SENSING (CS). 



relaxation of the original problem. Our task is to round the 
soft decision x* to obtain a valid Boolean fault pattern (or 
hard decision). Let G (0, 1) and set 

To create x, we simply round all entries of x* smaller than the 
threshold 9 to zero. Thus is a threshold for guessing that a 
fault has occurred, based on the relaxed MAP solution x* . As 
9 varies from to 1, this method generates up to n different 
estimates x, as each entry in x* falls below the threshold. We 
can efficiently find them all by sorting the entries of x*, and 
setting the values of each Xg to one in the order of increasing 
X*. We evaluate the loss for each of these patterns, and take 
as our best fault estimate the one that has least loss, which we 
denote by x"*". 

b) Local search: Further improvement of our estimate 
can sometimes be obtained by performing a local search 
around x^. We use a simple search: initializing x to x+, we 
cycle through indices s — 1, . . . ,n, where at step s we replace 
Xs with 1 — is- If this leads to a reduction in the loss function, 
then we accept the change and continue. Otherwise, we restore 
Xs to its original value and move on to the next index. We 
continue in this manner until we have rejected changes in all 
entries of x. At the end of this search, x is at least 1-OPT, 
which means that no change in any one entry will improve 
the loss function. Numerical experiments show that this local 
optimization method often has no effect, which means that x+ 
is often 1-OPT. In some cases, however, it can lead to modest 
reduction of loss compared to x^. 

V. Numerical Examples 

A. Algorithms for comparison 

We have implemented our NBP solver using Matlab; our 
implementation is available online ||201 . Table H] lists the 
different algorithms we evaluated. We compared NBP to 
several groups of competing state-of-art algorithms. First, we 
considered the interior point method (IP) for solving the fault 
identification problem (E). Second, we evaluated two other 
variants of NBP: (/) CSBP [16J; and (//) the low density lattice 
decoder (LDLC) ITtI . Third, we ran several non-Bayesian CS 
algorithms: (/) CoSaMP f38l; (ii) GPSR f39l; and (///) iterative 
hard thresholding (HardIO) L40J . Fourth, we implemented a 



Bipolar 


Binary 


Transformation 


X' G {-1,1}" 
y = Ax + V 
min II Ax — y\\ 

s.t. X e {-1, 1}" 


xe {0,1}" 

y = (2A)x + V 
min (2j4)2: — y\\ 

s.t. X e {0, 1}" 


x= {x + l)/2 
y = y + Al 



TABLE II 

Transformation between bipolar and binary representations 



semidefinite programming relaxation ll36ll . ||37]| . Finally, for a 
MAP problem over binary variables, it is natural to employ 
the discrete BP max-product algorithm defined directly on the 
model ([3]l, and so we implemented this algorithm as well. 
In practice, it performed significantly worse than the other 
algorithms. 

A technical subtlety that arises when handling the various 
algorithms is that they use one of two equivalent formulations 
of the problem: either a Boolean or bipolar representation. 
TablelUoutlines the two models and the transformation needed 
to shift between them; we use the notation 1 for the all-ones 
vector of appropriate size. Without loss of generality we use 
the binary form when describing the algorithms. 

Within the BP group, max-product BP fails to exploit the 
sparsity of the measurement matrix, and CSBP assumes a 
sparse not necessarily binary signal. Our NBP uses the same 
framework as CSBP, but with the correct fault prior. As we 
show shortly, this results in a significant improvement in 
identifying the correct fault pattern. LDLC does assume a 
binary prior but assigns faults and non-faults equal probability, 
which degrades performance. Within the linear programming 
group, linear programming and semidefinite programming re- 
lax the binary fault prior into the continuous domain, returning 
fractional results, and sparsity of the fault pattern is not 
assumed or exploited. Within the CS group, all algorithms 
exploit the sparse nature of the signal, but not the sparse 
measurement matrix or the binary nature of the signal. 

B. Experimental settings 

We consider an example with m = 100 sensors, n = 200 
possible faults, and linear measurements. The matrix A is 
taken to be sparse with a fixed percentage of non-zero entries 
q where the non-zero entries of A are drawn from a single 
distribution which is either (j) Bernoulli, where elements of A 
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Fig. 2. Success rate of each method tested, as a function of the sparsity of A (fault signatures), (a) With local optimization heuristics, and (b) without local 
optimization heuristics. Best viewed in color. 
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Fig. 3. Success rate as a function of the prior fault probability (sparsity of faults), (a) With local optimization heuristics; (b) without local optimization. Best 
viewed in color. 



are chosen randomly and independently to be either —1 or 1; 
and (ii) Gaussian, where non-zero elements of A are chosen 
from a Gaussian distribution with zero mean and unit norm. 
The problem is under-determined in the sense that we have 
fewer measurements than potential faults, a similar setting to 
CS. 

These two matrix types correspond to models that com- 
monly arise in practice. Bipolar matrices often occur, for 
example, in fault identification of linear analog circuits [41 1, 
ll42l . while Gaussian matrices were used in fSl and also 
appear in the CS literature |43|. In both models, we fix the 
measurement noise standard deviation to cr = 1. Where not 
otherwise stated, the fault probability is p — 0.05 (giving 10 
faults on average), and the matrix sparsity is g = 0.05. Finally, 
we define a successful reconstruction as a run of an algorithm 
that resulted in a solution with likelihood greater than or equal 
to the true solution. 

C. Discussion of results 

We show two sets of results, varying the sparsity of the fault 
signatures (the matrix A) in Fig. |2] and varying the sparsity of 



the fault patterns (rarity of faults) in Fig. [3] In both cases, 
we compare all 10 algorithms listed in Table |T] under two 
conditions: with local optimization heuristics and without. At 
each sparsity level, the plots show average performance over 
500 experiments. 

We first note the effect of the local optimization heuristics 
on the quality of the solutions. In both figures, the optimization 
heuristics have a significant positive effect, and this effect is 
particularly powerful when the probability of a fault is low. 
For example. Fig. |2ja)-(b) corresponds to fault probability 
0.05 across a range of fault signature sparsities, and for 
many techniques the local optimization dramatically improves 
performance (note the change in scale between the two plots). 
Even the null hypothesis (which simply guesses no faults) 
attains 96% success in identifying the correct faults with local 
optimization, since most runs are well explained by only a few 
faults. As the fault patterns become denser, local optimization 
becomes less powerful; see, for example, the performance of 
the null hypothesis in Fig. Oa), which decreases rapidly as 
faults become more common. 

Since most algorithms did very well on such spai^se fault 
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patterns with local optimization, we focus for the moment 
on the algorithms' performance without local heuristics. In 
Fig. |2|b), we can distinguish between 3 general groups of 
algorithms. First, the LDLC algorithm has a relatively low 
success rate, because it uses a uniform prior distribution of 
faults vs. non-faults. Second, the non-Bayesian CS algorithms 
(CoSaMP, GPSR and hardIO) do not account for the Boolean 
nature of x and thus have inferior performance. Third, NBP, IP, 
and SDP are all targeted at our fault identification problem, and 
designed to perform well for problems that have both sparse 
and Boolean-valued fault patterns. CSBP also performed sim- 
ilarly to this third group. In all cases, NBP performs as well 
or better than the other tested methods. 

We next evaluate the effect of fault pattern sparsity on solu- 
tion quality. Fig. [3] shows the performance of each algorithm, 
this time when the sparsity level of A is fixed to be 0.05, and 
the fault prior varies between 3% and 15%. When the fault 
prior increases, the problem becomes harder, because there are 
an increasing number of a priori likely fault combinations. 
For example, with a prior of 1/n we have on average only 
one fault, with n possible locations; when the prior is 2/n 
there are on average two faults and n{n — l)/2 possible 
locations, and so on. Increasing the prior fault probability 
shows a clear separation between the performance of the 
different algorithms, in which NBP outperforms IP in all cases 
(both with and without local optimization). 

Finally, we illustrate the convergence of NBP, CSBP, LDLC, 
and IP in Fig. |4l where m = 10, n — 15, and two faults oc- 
curred. Each axis indicates the belief or soft decision about one 
of the two (correct) faults. Note that all algorithms converge 
to solutions greater than 0.5, and will thus round to the correct 
solution in post-processing. The NBP algorithm converges in 
two iterations to the correct solution while IP requires more 
iterations for converging to an approximate solution, although 
we should be cautious in giving this conclusion too much 
weight, since the computational cost per iteration of the NBP 
algorithm is higher NBP converges accurately, because the 
prior distribution encourages it to converge to zero or one. 
CSBP converges to some positive but non-integral value, while 
LDLC also encourages binary values and is therefore the 
closest after NBP. Indeed, we have noticed in our experiments 
that NBP often converges faster than other algorithms. When 
many variables are involved, slow convergence or convergence 
to non-integral values may indicate that incorrect values are 
used to estimate other variables, making the overall process 
less accurate than when information on the binary nature is 
included (as in NBP). 

Overall, our experiments show that NBP has better perfor- 
mance for fault identification than other tested algorithms in 
a variety of scenarios, including different sparsity levels of 
the matrix A, different fault priors, and irrespective of the 
usage of local optimization procedures. As complementary 
material to this paper, we provide the Matlab source code of 
our experiments online L20J . 
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VI. Information Theoretic Characterization 

Why does NBP perform so well relative to state-of-the-art 
algorithms? A partial answer can be obtained by considering 
recent information theoretic results in the related domains of 
multi-user detection O, ||45l and CS Egl, 133. Consider the 
signal and measurement models as before, where y — Ax + v, 
Xs is i.i.d. Bernoulli with Y'i{xs ~ 1) — p, and v is i.i.d. zero 
mean unit norm Gaussian. Suppose also that the measurement 
matrix A is assumed to be i.i.d. where the probability of non- 
zero entries is q, and that non-zero matrix entries follow some 
unit norm distribution. We note in passing that the non-zero 
probability q must scale to zero as m and n grow; see details 
in Guo and Wang ||45J . 

Following the conventions of Guo et al. |I46|, ED, EH, the 
signal to noise ratio (SNR) 7 can be computed as, 

m 
—■ 

r 

We also define a distortion metric D that evaluates the 
approximation error between x and x by averaging over per- 
element distortions, 

1 

D{x,x) = - d(xs,Xs), 

s=l 

where d{-,-) is a distortion metric such as square error or 
Hamming distortion. For this problem where a sparse mea- 
surement matrix A is used, Guo and Wang f45l provided 
the fundamental information theoretical characterization of the 
optimal performance that can be achieved, namely the minimal 
D that can be achieved as m and n scale to infinity with fixed 
aspect ratio 6, i.e., lim„^oo — = S. 

There are several noteworthy aspects in the analysis by Guo 
and Wang. First, Theorem 1 |45| proves that the problem 
behaves as if each individual input element Xg were estimated 
individually from a corrupted version Zg, with Zg = Xg + Wg 
where Wg is Gaussian noise. That is, the vector estimation 
problem is related to an estimation problem defined over scalar 
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Gaussian channels. Second, Theorem 2 B3]| demonstrates that 
the amount of scalar Gaussian noise in the random variable 
Ws can be computed by finding the fixed point for 77 of the 
following expression, 

77 =1H -mmse(p, 777), (15) 

po 

where ry e (0,1) is the degradation in SNR, i.e., the SNR 
of the scalar channel is 7/7 instead of 7, and mmse(p, 7/7) 
is the minimum mean square error (MMSE) for estimating 
the random variable Xg ^ Ber(p) from the output Zg of 
the dual scalar channel whose SNR is 777. That is, the scalar 
Gaussian channels are degraded. Combining these insights, 
the fundamental limiting performance can be computed for 
any distortion metric D by examining the output of the 
degraded scalar channel with SNR rjj. Finally, Guo and Wang 
demonstrate that in the large system limit (when m and n 
scale to infinity) with fixed aspect ratio 6, the posteriors 
estimated by BP converge in distribution to the true posteriors. 
Consequently, it should be no surprise that numerical results 
in Section [V] indicate that NBP outperforms other techniques 
for fault identification in practice. 

Although the fundamental performance of fault identifica- 
tion is well understood when A is sparse and i.i.d., when 
A is not sparse the analysis becomes more involved. In the 
non-sparse case, replica method analyses have been used to 
derive information theoretic characterizations in a less rigorous 
fashion; see Guo and Verdu It44il or Guo and Tanaka I48J . 

VII. Discussion 

In this paper, we proposed a novel approach for solving 
the fault identification problem using non-parametric belief 
propagation. By exploiting prior information about faults, 
we are able to improve significantly our ability to correctly 
identify fault patterns. Our main improvement arises from 
accounting for both the binary nature and prior probability 
of the faults. 

A possible extension is the inclusion of non-i.i.d. noise. In 
principle, this can be dealt with by augmenting the model in 
Fig. [Tfa) with a graph structure over measurements y reflecting 
the inverse covariance of the noise. We would expect our 
approach to continue to do well for relatively sparse inverse 
covariances, but an exact characterization is left for future 
study. 

Another extension would be to consider non-binary faults. 
As long as the fault pattern is i.i.d., the recent information 
flieoretic results of Guo et al. ||46|, ||45l, |l48l indicate that 
belief propagation should continue to perform well. 

As a final note, we mention that our proposed algorithm 
is distributed, since its underlying BP algorithm can be nat- 
urally distributed over multiple nodes, and works well when 
the matrix A is sparse. In a network where communication 
is costly, our algorithm offers the additional advantage of 
requiring fewer communication rounds. 
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